About the project

Our era of data - larger than ever and complex like chaos - requires several skills from statisticians and other data scientists. In this course I am about to learn some new skills to be able to work with data

In this course I will learn

  • use open data science tools
    • R
    • R-markdown
    • Github

https://github.com/lottaimmeli/IODS-project


Chapter 2. Regression and model validation

This week I am doing regression analysis and model validation. But first I had to wrangle with the original data. That was pretty difficult for me, but it was good practice and I just need to practice a lot more

Description of the dataset

# After I had been wrangling with the data and managed to write it out in a folder. I could use my own wrangled data for the further analysis... 

students2014 <- read.csv(file="~/Documents/GitHub/IODS-project/Data/learning2014.csv", header = TRUE, sep=",")

#Looking the data structure

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
head(students2014)
##   gender Age Attitude Points     deep  stra     surf
## 1      F  53       37     25 3.583333 3.375 2.583333
## 2      M  55       31     12 2.916667 2.750 3.166667
## 3      F  49       25     24 3.500000 3.625 2.250000
## 4      M  53       35     10 3.500000 3.125 2.250000
## 5      M  49       37     22 3.666667 3.625 2.833333
## 6      F  38       38     21 4.750000 3.625 2.416667

This is a data of 166 observations (students). Of them we have 7 variables: the information about their gender, age, global attitude toward statistics (Attitude), exam points (Points) and their points related to different aspects of learning (Deep, strategic and surface learning).

A graphical overview of the data and show summaries of the variables in the data.

pairs(students2014[-1], col=students2014$gender)

summary(students2014)
##  gender       Age           Attitude         Points           deep      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00   Min.   :1.583  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   1st Qu.:3.333  
##          Median :22.00   Median :32.00   Median :23.00   Median :3.667  
##          Mean   :25.51   Mean   :31.43   Mean   :22.72   Mean   :3.680  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75   3rd Qu.:4.083  
##          Max.   :55.00   Max.   :50.00   Max.   :33.00   Max.   :4.917  
##       stra            surf      
##  Min.   :1.250   Min.   :1.583  
##  1st Qu.:2.625   1st Qu.:2.417  
##  Median :3.188   Median :2.833  
##  Mean   :3.121   Mean   :2.787  
##  3rd Qu.:3.625   3rd Qu.:3.167  
##  Max.   :5.000   Max.   :4.333

I think this method is not very good. It??s very difficult to interpret it like this. I need a better graph..

#use the packages GGally and ggplot2 and get some help with the graphical overview

library(GGally)
library(ggplot2)

p <- ggpairs(students2014, mapping = aes(alpha=0.5), lower=list(combo =wrap("facethist", bins=20)))

p

This is better view of the data. All the variables (expect for the age, where skewness >0) are pretty nicely normally distributed. This also tells me the correlation between the different variables. There doesn??t seem to be very strong correlation between the different variables since the correlation coefficients are between -0.3 - 0.4

Making a regression model

#Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

model <- lm(Points ~ gender + Attitude + stra, data=students2014)
summary(model)
## 
## Call:
## lm(formula = Points ~ gender + Attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97982    2.40303   3.737 0.000258 ***
## genderM     -0.22362    0.92477  -0.242 0.809231    
## Attitude     0.35100    0.05956   5.893 2.13e-08 ***
## stra         0.89107    0.54409   1.638 0.103419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

I chose to study the association of exam points (target value) with gender, attitude and strategic learning (explanatory variables). Here I can see that the Attitude is the only one of these three explanatory values to be statistifically siginificant (p-value is < 0.05.). Also the t value 5.893 tells about the significance

Next I make a regression model with only the one significant explanatory variable “Attitude”.

model2 <- lm(Points ~ Attitude, data=students2014)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Here I get the answer that the “Attitude” estimate is 0.35 and the p-value stays low (below 0.05 meaning it is significant). In other words this means that when Attitude increases by 1 unit the exam points increases by 0.35.

Next I need to explain and interpret the multiple R squared of the model. R-squared is a statistical measure of how close the data are to the fitted regression line. Meaning how well the model fits my data.

The definition of R-squared (selitysaste) is the percentage of the response variable variation that is explained by a linear model.

R-squared = Explained variation / Total variation R-squared is always between 0 and 100%:

Here in the summary I can see the Multiple R-squared is 0.1906 -> 19% so this means that my model would explain only one fifth of the exam points around their mean.

“In general, the higher the R-squared, the better the model fits my data. R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.”

“R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!”

Diagnostic plots to explain the assumptions of the model

There are several assumptions of linear regression model. With the help of the following plots I can analyze the residuals of my model and see how well my linear regression model is working here or is there some serious problems with it.

plot(model2, which= c(2,1,5))

Residuals vs fitted plot (homoscedasticity)

This plot shows that the errors/residuals have constant variance. I can find a a equally spread residuals around a horizontal line without distinct patterns. This is a good indication!

Q-Q plot (normality)

With the Q-Q plot I can explore that the residuals are normally distributed. Here I can see that point are very close to the line expect of upper and lower tails where there is some deviation. However I think this still is reasonable and I could interpret that the errors are normally distributed.

Residuals vs Leverage

This plot helps me to clear if I have outliers in my data that are influencial in my linear regression model. In my analysis I have no such cases that would be influencial in my model. All my cases are inside the Cook??s distance lines (I can not even see them)


Chapter 3. Logistic regression.

Reading the data

This week I??m learning about logistic regression.

Before this analysis part I have been doing some data wrangling and prepared the data so that I am able to analyse it.

#opening the data from the file and checking how it looks.. 

alc <- read.csv("~lottaimmeli/Documents/GitHub/IODS-project/Data/alc.csv", header = TRUE, sep = ",")

About the data

This data is about student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. There were originally two datasets provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

In my data set I have combnined the information of these two datasets. I have observations of those 382 students who were present in both of the datasets Mathematics and Portuguese. In the original dataset there were 33 variables. However I have calculated some new variables ‘alc_use’ which corresponds to the average weekday and weekend alcohol consumption anb ‘high_use’ if the ‘alc_use’ is over 2.

You can find more information about the original data set and the variables here: https://archive.ics.uci.edu/ml/datasets/Student+Performance

dim(alc)
## [1] 382  37
colnames(alc)
##  [1] "school"      "sex"         "age"         "address"     "famsize"    
##  [6] "Pstatus"     "Medu"        "Fedu"        "Mjob"        "Fjob"       
## [11] "reason"      "nursery"     "internet"    "guardian"    "traveltime" 
## [16] "studytime"   "failures"    "schoolsup"   "famsup"      "paid"       
## [21] "activities"  "higher"      "romantic"    "famrel"      "freetime"   
## [26] "goout"       "Dalc"        "Walc"        "health"      "absences"   
## [31] "G1"          "G2"          "G3"          "alc_use"     "high_use"   
## [36] "probability" "prediction"
summary(alc)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use        probability     prediction     
##  Mode :logical   Min.   :0.2883   Mode :logical  
##  FALSE:268       1st Qu.:0.3597   FALSE:272      
##  TRUE :114       Median :0.4170   TRUE :110      
##                  Mean   :0.4555                  
##                  3rd Qu.:0.5197                  
##                  Max.   :0.9398

Choosing the variables to study the relationship between high or low alcohol consumption.

The purpose of my analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose 4 interesting variables (failure, absences, famrel and higher) and state the hypotheses.

  1. The first variable “failure” tells about the number of past class failures. I think there might be a correlation between higher alcohol consumption and more class failures.

  2. “Absences” corresponds to the number of school absences. I am instrested to find out if those who drink more have more school absences.

  3. “Famrel” corresponds to the quality of family relationships. Maybe those who have poor family relationships consume more alcohol?

  4. “Higher” clarifies if these students want to have higher education (yes or no). Maybe those who want to educate them selves in the future will study more and drink less..

Numerically and graphically exploring the distributions of my chosen variables and their relationships with alcohol consumption.

  1. First, is there some relationship between alcohol comsumptions and number of past class failures.
#Numerically exploring the relationship between high_use and failure. table and geom_count plot

mytable <- table(high_use = alc$high_use, number_of_past_class_failures = alc$failures)

addmargins(mytable)
##         number_of_past_class_failures
## high_use   0   1   2   3 Sum
##    FALSE 244  12  10   2 268
##    TRUE   90  12   9   3 114
##    Sum   334  24  19   5 382
library(ggplot2)

failure1 <- ggplot(alc, aes(x=high_use, y = failures))
failure1 + geom_count()

In this table and plot I can see that the ones who have low use of alcohol might proportionally have less class failures compared to the students with high alcohol. As seen here the ones who consume less alcohol more often have no past class failures.

  1. Is there some relationship between alcohol consumption and school absences?
#making a boxplot
g <-ggplot(alc, aes(x=high_use, y = absences))

g2 <- g + geom_boxplot() + ggtitle("Absences versus high alcohol use")
g2 

It looks like there might be a correlation between high use of alcohol and more school absences.

  1. How about the relationship between alcohol consumption and family relationships?
#making a boxplot

g_fam <-ggplot(alc, aes(x=high_use, y = famrel,))

g_fam2 <- g_fam + geom_boxplot() 
g_fam2

It seems that those who have low use of alcohol think that their family relationships are better (higher points). But which one is the cause and which one the effect. Are the students having bad relationships at home and they start to use more alcohol. Or is everything going well at home and from the point they start to use more alcohol the relationships at home gets more problematic.

  1. Relationship between alcohol consumption and future educational plans
table_plans <- table(high_use= alc$high_use, wants_high_education = alc$higher)
table_plans
##         wants_high_education
## high_use  no yes
##    FALSE   9 259
##    TRUE    9 105
round(prop.table(table_plans) * 100, 1)
##         wants_high_education
## high_use   no  yes
##    FALSE  2.4 67.8
##    TRUE   2.4 27.5

Maybe it seems that those who consume more alcohol, do not have that often plan to get high education..

Logistic regression

Now I want statistically explore the relationship between my four chosen variables and the binary high/low alcohol consumption variable as the target variable. Here is the summary of my fitted model.

#find the model with glm in the first model there is the intercept and in the m2 I took the intercept away

m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3593  -0.7890  -0.6902   1.1782   1.8993  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.12562    0.74528  -0.169 0.866143    
## failures     0.43484    0.19618   2.217 0.026653 *  
## absences     0.08341    0.02272   3.671 0.000241 ***
## famrel      -0.22708    0.12502  -1.816 0.069322 .  
## higheryes   -0.36274    0.55177  -0.657 0.510924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 436.43  on 377  degrees of freedom
## AIC: 446.43
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    failures    absences      famrel   higheryes 
## -0.12562333  0.43483926  0.08340567 -0.22707832 -0.36273561
m2 <- glm(high_use ~ failures + absences + famrel + higher - 1, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher - 
##     1, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3593  -0.7890  -0.6902   1.1782   1.8993  
## 
## Coefficients:
##           Estimate Std. Error z value Pr(>|z|)    
## failures   0.43484    0.19618   2.217 0.026653 *  
## absences   0.08341    0.02272   3.671 0.000241 ***
## famrel    -0.22708    0.12502  -1.816 0.069322 .  
## higherno  -0.12562    0.74528  -0.169 0.866143    
## higheryes -0.48836    0.51592  -0.947 0.343857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 529.56  on 382  degrees of freedom
## Residual deviance: 436.43  on 377  degrees of freedom
## AIC: 446.43
## 
## Number of Fisher Scoring iterations: 4
coef(m2)
##    failures    absences      famrel    higherno   higheryes 
##  0.43483926  0.08340567 -0.22707832 -0.12562333 -0.48835893

Logistic regression model summary interpretation: As seen above the p-value is low (<0.05) in failures, absences and famrel. The coefficient in failures and absences is positive (failures 0.45, absences 0.07), meaning that more failures and more absences predicts higher alcohol use. On the other hand familyrelationship seems to have negative effect (-0.31) on the high alcohol use. Meaning that better familyrelationship have protective effect on alcohol use.

The future education plan (plans to get a higher education) have also negative effect on the high alcohol use (-0.4), but it is not statistically significant.

Presenting and interpreting the coefficients of the model as odds ratios and provide confidence intervals for them.

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.0 --
## <U+221A> tibble  1.3.4     <U+221A> purrr   0.2.4
## <U+221A> tidyr   0.7.2     <U+221A> dplyr   0.7.4
## <U+221A> readr   1.1.1     <U+221A> stringr 1.2.0
## <U+221A> tibble  1.3.4     <U+221A> forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)

#compute odds ratios and and confidence intervals
OR <- coef(m) %>% exp
CI <- exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 0.8819470 0.1995909 3.776889
## failures    1.5447147 1.0499664 2.282173
## absences    1.0869827 1.0418018 1.139025
## famrel      0.7968584 0.6230747 1.019051
## higheryes   0.6957704 0.2372414 2.121797

In this table one can see the coefficients of the model as odds ratios and their confidence intervals. The odds ratios for failures is 1.54 (CI 1.05-2.28) and for absences 1.08 (1.04-1.14) meaning that higher consumption of alcohol increases the odds for absence and failures by 1.54 and 1.08 times respectively. The better family relationship makes the odds for high alcohol consumption 0.8 times less likely.

The higher educational plan is not significant since the coefficient is 0.70 and CI is 0.2-2.1 (number 1 in included in the CI).

These findings support my earlier hypotheses.

Exploring the predictive power of my model.

I??m using the variables which had a statistical relationship with high/low alcohol consumption according to my model and exploring the predictive power.

# predict the probability of high use and then adding, the new variable in the alc dataset. Then using the probability to make a prediction of high_use. lookin the last 10 observations

probabilities <- predict(m, type= "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, famrel, high_use, probability, prediction) %>% tail(10)
##     failures absences famrel high_use probability prediction
## 373        1        0      4    FALSE   0.2765114      FALSE
## 374        1        7      5     TRUE   0.3531843      FALSE
## 375        0        1      5    FALSE   0.1764851      FALSE
## 376        0        6      4    FALSE   0.2898242      FALSE
## 377        1        2      5    FALSE   0.2646186      FALSE
## 378        0        2      4    FALSE   0.2262058      FALSE
## 379        2        2      2    FALSE   0.5234763       TRUE
## 380        0        3      1    FALSE   0.3857482      FALSE
## 381        0        4      2     TRUE   0.3523118      FALSE
## 382        0        2      4     TRUE   0.2262058      FALSE
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.04712042 0.70157068
##    TRUE  0.25392670 0.04450262 0.29842932
##    Sum   0.90837696 0.09162304 1.00000000
gpred <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

gpred + geom_point()

# the training error

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) >0.5
  mean(n_wrong)
}

The proportion of incorrectly classified observations (meaning the training error) is (0.30) 30%. I think the percentage is pretty high. I can’t really think my model is very good when one third of the observations are incorrectly classified.

Then I also made a 10-fold cross-validation where I got the same number (0.30) 30% as the average number of wrong predictions.

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3141361

Chapter 4. Clustering and classification

NA

About the data

In this exercise we are examining the Boston dataset from R MASS package. This dataset is about housing values in suburbs of Boston. The data frame has 506 observations and 14 variables.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#calling for the Boston dataset form MASS package
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

So we have 506 subrubs of Boston and here are the explanations of the different variables:

crim : per capita crime rate by town zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox: nitrogen oxides concentration (parts per 10 million) rm: average number of rooms per dwelling age: proportion of owner-occupied units built prior to 1940 dis: weighted mean of distances to five Boston employment centres rad: index of accessibility to radial highways tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town lstat: lower status of the population (percent) medv: median value of owner-occupied homes in $1000s

#calling for the different packages I might use in this exercise
library(ggplot2)
library(GGally)
library(tidyverse)
library(corrplot)
## corrplot 0.84 loaded
#checking the summary of the Boston dataset, 
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Here is the summary of the data. Chas is a binary variable. Other variables execpt for the crim and zn seems to be normally distributed (mean and median more or less close to each other).

Looking for the correlations

Let’s also make a graph..

#graphical overview
pairs(Boston)

NA

#making a correlation matrix and drawing a correlation plot to be able to visualie it better and for easier interpretation

cormatrix <- cor(Boston)
cormatrix %>% round(digits=2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cormatrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

There is strong negative correlation (big red ball), between dis-nox, dis-age adn dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This makes sense.

Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is understandable.

Rad and tax have strong positive correlation, meaning when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises. Why not?

Getting ready for the analysis

Let’s move furher with the analysis..

I need to standardise the dataset to get normally distributed data. I print the summary of the scaled data set.

#Standardising the dataset
boston_scaled <- scale(Boston)

#printing out summaries of the scaled data
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#My boston_sclaed data is a matrix and I make it as a data frame for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled<- as.data.frame(boston_scaled)

Now we have scaled the data and as seen in the summary now all the means and medians are close to each other meaning that they are normally distributed and with the help of the scaling this data can be fitted in a model.

Next I will change the continuous crime rate variable in my data set to be a categorical variable. I want to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. I drop the old crim variable from the dataset and replace it with the new crime variable.

#create a quantile vector
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#and create a categorial variable crime

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label= c("low", "med_low", "med_high", "high"))


table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
library(dplyr)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
dim(boston_scaled)
## [1] 506  14

Now I need to divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

#Here I make the train and the test sets. I choose 80% of the observations to the train set and the rest to the test set
dim(boston_scaled)
## [1] 506  14
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)

dim(ind)
## NULL
#create the train set

train <- boston_scaled[ind,]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  1.01 1.01 1.01 1.01 1.01 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 3.665 ...
##  $ nox    : num  1.599 0.253 0.218 1.194 1.409 ...
##  $ rm     : num  -0.0934 -0.1745 0.2169 0.1713 3.5515 ...
##  $ age    : num  1.116 1.024 0.228 0.974 0.509 ...
##  $ dis    : num  -0.85 -0.755 -0.427 -1.006 -0.898 ...
##  $ rad    : num  1.66 1.66 1.66 1.66 1.66 ...
##  $ tax    : num  1.53 1.53 1.53 1.53 1.53 ...
##  $ ptratio: num  0.806 0.806 0.806 0.806 0.806 ...
##  $ black  : num  0.4274 -0.5905 0.4019 0.4406 -0.0233 ...
##  $ lstat  : num  0.551 1.603 0.239 0.941 -1.031 ...
##  $ medv   : num  -0.4494 -1.0039 0.0725 -1.0909 -0.0688 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 4 4 3 4 3 3 2 1 2 1 ...
#create the test set
test <- boston_scaled[-ind,]
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  -0.4872 0.0487 -0.4872 -0.4872 -0.4872 ...
##  $ indus  : num  -1.306 -0.476 -0.437 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.834 -0.265 -0.144 -0.144 -0.144 ...
##  $ rm     : num  1.015 -0.563 -0.794 -0.455 -0.268 ...
##  $ age    : num  -0.8091 -1.0507 0.0329 0.7327 1.0063 ...
##  $ dis    : num  1.076671 0.786365 0.000692 0.103175 -0.016737 ...
##  $ rad    : num  -0.752 -0.522 -0.637 -0.637 -0.637 ...
##  $ tax    : num  -1.105 -0.577 -0.601 -0.601 -0.601 ...
##  $ ptratio: num  0.113 -1.504 1.175 1.175 1.175 ...
##  $ black  : num  0.416 0.371 0.375 0.393 -1.187 ...
##  $ lstat  : num  -1.36 0.428 -0.192 0.165 1.076 ...
##  $ medv   : num  1.1816 -0.0906 -0.4711 -0.3189 -0.9821 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 2 3 3 3 1 2 2 2 2 ...

Linear discriminan analysis (LDA)

Now I’m making a linear discriminant analysis on the train set. I use the categorical crime rate as the target variable and all the other variables are predictor variables. I draw the LDA (bi)plot of the model.

# fit the linear discriminant analysis on the train set. crime as the target variable and all the other variables as predictor variables


lda.fit <- lda(formula= crime ~ ., data = train)


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# drawing a plot of the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Then I predict the classes with the LDA model on my test data and cross tabulate the results with the crime categories from the test set.

#saving the crime categories from the test set 
correct_classes <- test$crime

library(dplyr)


#removing the categorial crime variable from the test dataset

test <- dplyr::select(test, -crime)

#Predicting the vallses with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

#cross tablating the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17       3        1    0
##   med_low    4      17        3    0
##   med_high   0       7       20    1
##   high       0       0        0   29

Here I can see that of my 22 low crime rates 12 was predicted correctly in the right class, but 9 was predicted as med_low and 1 as med_high, there were also some errors predicting the med_low and low crime rates. However all the high crime rates was predicted correctly in the right class. My model works well predicting the high crime rates, byt it makes some errors predicting the other classes. The same phenomena was visible in the train set plot.

Distance measures and clustering

Next I move towards clustering and measure the distances. I use the Euklidean distance, which is possibly the most common one. K-means is old and much used clustering method. Kmeans counts the distance matrix automatically but I have to choose the number of clusters. I tryed to make the model wiht 4 clusters, but for me it seems that 3 clusters works better.

Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

#Reloading the Boston dataset and standardising the dataset (variables have to be normally ditributed)
dim(Boston)
## [1] 506  14
scale_Boston2 <- scale(Boston)

scale_Boston2 <- as.data.frame(scale_Boston2)


#Calculating the distances. Euklidean distance

dist_eu <- dist(scale_Boston2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <-kmeans(scale_Boston2, centers = 3)


# ploting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:6], col = km$cluster)

pairs(scale_Boston2[7:13], col = km$cluster)

Next I investigate what is the optimal number of clusters. There are many ways to find the optimal number of clusters but here I will be using the Total of within cluster sum of squares (WCSS) and visualising the result with a plot.

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal number of clusters is when the total WCSS drops radically and above I can see that it happens around x= 2. So the optimal number of clusters would be 2. Next I run the algorithm again with two clusters.

# k-means clustering
km <-kmeans(scale_Boston2, centers = 2)

# plot the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:8], col = km$cluster)

pairs(scale_Boston2[6:13], col = km$cluster)

NA NA